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SPARSE RECOVERY IN CONVEX HULLS 
VIA ENTROPY PENALIZATION 1 

By Vladimir Koltchinskii 

Georgia Institute of Technology 

Let (X, Y) be a random couple in S x T with unknown distribu- 
tion P and (Xi,Yi), . . . , (X„,Y„) be i.i.d. copies of (X, Y). Denote P n 
the empirical distribution of (Xi,Yi), . . . , (X n ,Y n ). Let hi, . . . , Kn '■ S i- 
[—1,1] be a dictionary that consists of N functions. For A € M N , de- 
note fx := =1 \jhj. Let l:Tx!«l be a given loss function 
and suppose it is convex with respect to the second variable. Let 
(£ • f)(x,y) := £(y,f(x)). Finally, let AcR N be the simplex of all 
probability distributions on {1, . . . ,N}. Consider the following pe- 
nalized empirical risk minimization problem 



A e := argmin 

AGA 



P„(WA) + e]T)A i lo g A, 



3=1 

along with its distribution dependent version 



A e := argmin 

asa 



P(£'fx)+e^\ j log\ j 

3 = 1 



where e > is a regularization parameter. It is proved that the "ap- 
proximate sparsity" of A e implies the "approximate sparsity" of A 
and the impact of "sparsity" on bounding the excess risk of the em- 
pirical solution is explored. Similar results are also discussed in the 
case of entropy penalized density estimation. 

1. Introduction. Let S and T be measurable spaces with a-algebras S 
and T, respectively, and let (X, Y) be a random couple in S x T. The distri- 
bution of (X, Y) will be denoted by P and the distribution of X by II. The 
training data [X\, Y{), . . . , (X n ,Y n ) consists of n i.i.d. copies of (X, Y) (the 
distribution P is not known and it is to be estimated based on the data). 
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We will denote P n the empirical distribution of the data and will write in 
what follows 

n 

Pg = Eg(X,Y) and P n g = n~ l g^,^) 

i=i 

for functions g on S x T (as well as for functions on S since they can be also 
viewed as functions on 5 x T). 

We will be interested in a class of prediction problems in which Y is to 
be predicted based on an observation of X. Prediction rules will be based 
on the training data (Xi,Y\), . . . , (X n , Y n ). 

Let £ : T x R i— ► R+ be a loss function. It will be assumed in what follows 
that, for all y G T, £(y, •) is convex. For a function / : S i— > R, let (£• f)(x, y) := 
^(y, f(x)). Then the quantity P(£» f) is i/ie (true) risk of the prediction rule 
/ and P n {£* f) is the corresponding empirical risk. T/ie excess risk of / is 
defined as 

£(/) := P(£ . /) - inf P(£ . 5 ) = P(£ . /) - P(£ . /*), 

where the infimum is taken over all measurable functions and it is assumed 
for simplicity that it is attained at /* G L2(H) (moreover, it will be assumed 
in what follows that /* is uniformly bounded by a constant M). 
Let 

H:={hi,...,h N } 

be a given finite class of measurable functions from S into [—1,1] called 
a dictionary (of course, it can be assumed instead that the functions in 
the dictionary are uniformly bounded by an arbitrary constant; the only 
change will be in the constants in the results below). The dictionary can be 
an orthonormal system of functions, a union of several orthonormal systems 
suitable for approximation of the target function /* , a base class of a boosting 
type algorithm, a set of pretrained estimators in an aggregation problem, 
etc. Let ViTL) be the set of all probability measures on TL. For A G V(H), 
denote Xj := X({hj}) and 




Denote A := {(Ai, . . . , Ajv) : Xj > 0, j = 1, . . . ,N, £j=i Xj = 1}. We will iden- 
tify (whenever it is convenient) probability measures A G V(7i) with vectors 
(Ai, . . . , Ajv) from the simplex A. We will write (with a little abuse of no- 
tation) A = (Ai, . . . , Ajv)- Clearly, the function f\:S^> [—1,1] is a convex 
combination (a mixture) of functions from the dictionary and the set 

conv(H):={/ A :AGP(H)} 
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is the convex hull of 7i. 

As always, define the entropy of A as 

N 

F(A) = log A,, 

i=i 

The Kullback-Leibler divergence between A, v G A is defined as 

iY 



tf(AH:=£Vog(^Y 



Denote 

K(\,v) :=K(\\v)+K(v\\). 
The following penalized empirical risk minimization problem will be studied: 



(1.1) 



A £ :=argmin[P n (£./ A )-eff(A)] 

\£V{H) 



argmm 

AeA 



A 1 



3=1 



where e > is a regularization parameter. Since, for all y, £(y,-) is convex, 
the empirical risk P n {£ • fx) is a convex function of A. Since also the set 
V(7i) is convex (it can be identified with the simplex A) and the function 
A i— > —H(X) is convex, this makes the problem (1.1) a convex optimization 
problem. It is natural to compare this problem with its distribution depen- 
dent version 



A e := argmin[P(^ • f x ) - eH(X}] 

= argmin P(£» f x ) +e^2 x j lo S A j 

AeA L j=1 J 

In the recent literature, there has been considerable attention to the prob- 
lem of sparse recovery in a linear span of a given dictionary using penalized 
empirical risk minimization with £i-penalty (this method is called LASSO 
in the literature on regression), and the current paper is close to this line of 
work. It has become clear that sparse recovery is possible not always, but 
only under some geometric assumptions on the dictionary. These assump- 
tions are often described in terms of the properties of the Gram matrix of 
the dictionary, which in the case of random design models is the matrix 



H:= ((/ii,/ij)L 2 (n))ij=i,Ar> 
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and they take form of various conditions on the entries of this matrix ( "co- 
herence coefficients" ) , or on its submatrices (in spirit of "uniform uncertainty 
principle" or "restricted isometry" conditions) . The essence of these assump- 
tions is to try to keep the dictionary not too far from being orthonormal 
in L2(n), which in some sense is an ideal case for sparse recovery [see, e.g., 
Donoho (2006), Candes and Tao (2007), Rudelson and Vershynin (2005), 
Mendelson, Pajor and Tomczak-Jaegermann (2007), Bunea, Tsybakov and 
Wegkamp (2007a), van de Geer (2008), Koltchinskii (2008a, 2008b) and 
Bickel, Ritov and Tsybakov (2008), among many other papers that study 
both the random design and the fixed design problems]. 

The idea to use the entropy for complexity regularization is not new in 
information theory and statistics (recall, e. g., the principle of maximum 
entropy). In particular, it has been studied recently in connection with the 
problem of aggregation of statistical estimators by exponential weighting 
and also in a large number of papers on PAC-Bayesian approach in learn- 
ing theory [see, e.g., McAllester (1999), Catoni (2004), Audibert (2004), 
Zhang (2001, 2006a, 2006b) and references therein]. However, we are not 
aware of any attempt to relate this penalization technique to sparse recov- 
ery problems with an exception of a very recent paper by Dalalyan and 
Tsybakov (2007), where it is done in the context of aggregation with expo- 
nential weighting. Moreover, at least at the first glance, the idea of using 
this type of penalization to achieve sparse recovery seems counterintuitive 
since the penalty —H(X) attains its minimum at the uniform distribution 
Xj = N~ l ,j = 1, ... ,7V, and, from this point of view, it penalizes for "spar- 
sity" rather than for "nonsparsity" [in fact, solutions of (1.1), (1-3) can be 
only "approximately sparse"]. 

In this paper we follow the approach of Koltchinskii (2005, 2008a), where 
the problem was studied in the case of ^-penalization with 1 < p < 1 + lo ^ N . 
This approach is based on separate study of random error \£ (f%e) ~ £(f\ E )\ 
and of approximation error £(f\e). It happens that these are two different 
problems with not entirely the same geometric parameters responsible for 
the size of each of the two errors, and the geometry of the problem is more 
subtle than in the standard approach based on conditions on the Gram ma- 
trix H . In many problems in Statistics and Learning Theory the distribution 
of the design variable is completely unknown and it is unrealistic to make 
any restrictive assumptions on its Gram matrix. Because of this reason, it is 
desirable to study in a more precise way how the excess risk of the solution 
of (1.1) depends on geometric parameters of the problem. 

One of our goals is to show that if A £ is 11 approximately sparse" (i.e., 
this measure is almost concentrated on a small set of atoms) , then a similar 
property is enjoyed by A £ . These sparsity bounds provide a way to control 
\\f\e ~ f\ E \\L 2 {U) an d K(\ e ,\ e ) (see Theorems 1 and 2). For instance, we 
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show that for any set J C {1, . . . , N} with card( J) = d and such that 




the following bound holds with a high probability: 



X .-fx4l 2{ n)+eK(\C;\c)<C 



d + logiV 



n 

This allow us also to bound "the random error" \£ (/t e ) — £ (/a s )| i n terms 
of "approximate sparsity" of the problem (Theorem 3). 

Some further geometric parameters (such as "the alignment coefficient" 
introduced in the next section) provide a way to control "the approximation 
error" £(f\e) (see Theorem 4). Namely, suppose there exists a vector A £ A 
with the following properties: 

(i) A is "sparse" [i.e., its support J = supp(A) is a set of relatively small 
cardinality] ; 

(ii) the excess risk £{f\) is small; 

(iii) A is "aligned" nicely with the dictionary (the precise definitions are 
given in the next section). 

Then A e is approximately sparse and its excess risk £(f\e) is small (more 
precisely, its size is controlled by sparsity of A and its "alignment" with the 
dictionary). These results ultimately yield oracle inequalities on the excess 
risk £{f^e) showing that this estimation method provides certain degree of 
adaptation to unknown "sparsity" of the problem (see Corollary 1). 

Density estimation problem can be also studied rather naturally in a simi- 
lar framework. In this problem, the data consists of n independent identically 
distributed observations X\ , . . . , X n in S with common distribution P. Sup- 
pose that P has density /* with respect to a er-finite measure /i in (S, A). We 
will assume that /* is uniformly bounded by a constant M. Let hi, . . . ,/tjv 
be a large dictionary of probability densities with respect to fi uniformly 
bounded by 1 (as in the case of prediction problem discussed above, one can 
assume that these densities are uniformly bounded by an arbitrary constant 
resulting in a proper change of constants in the theorems). The goal is to 
construct an estimator of /* in the class of mixtures {f\ : A G A}. The under- 
lying assumption is that there exists a "sparse" mixture that approximates 
the unknown density reasonably well. One can use an estimator based on 
minimizing the entropy penalized empirical risk with respect to quadratic 
loss: 



(1.3) A e := argmin 

AgA 



N 

l! 2(M )-2P n / A + e ]>>logA, 

5=1 
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which is again a convex minimization problem. The corresponding penalized 
true risk minimization problem is 



argmin[||/ A - /*||| 2(m) - eH(X)} 



(1.4) 



AgA 



argmin 

AgA 



N 

fx\\l M -^pfx+eJ2 x ^^ 

3=1 



Recently, Bunea, Tsybakov and Wegkamp (2007b) studied a similar den- 
sity estimation problem with ^-penalized empirical risk with respect to 
quadratic loss (and for the linear aggregation instead of convex aggrega- 
tion). As in the case of prediction problems (regression, classification), we 
also obtain the bounds characterizing approximate sparsity of the empirical 
solution in terms of approximate sparsity of the true solution and oracle 
inequalities for \\f~ X£ — f*\\\ 2 ^) (which is equivalent to considering the excess 
risk in this problem; see Theorems 5-7, Corollary 2). 

2. Main results. 

2.1. Assumptions on the loss. We assume below the following properties 
of the loss function I: for all y £ T, £(y,-) is twice differentiable, is a 
uniformly bounded function in T x 1 and 

sup£(y; 0) < +oo, sup \£' u (y; 0)| < +oo. 

y€T j/GT 

Moreover, denote 

(2.1) t(R) :=±inf inf ?'(y,u). 

K ' y } 2 y£T\ u \<R 1 

It will be assumed that 

r(M V 1) > 

(recall that M is a constant such that ||/*||oo < M). Without loss of gener- 
ality, we also assume that t(R) <1,R>0 (otherwise, it can be replaced by 
a lower bound). 

There are many important examples of loss functions satisfying these 
assumptions, most notably, the quadratic loss £(y,u) := (y — u) 2 in the case 
when T C R is a bounded set. In this case, r = 1. In regression problems 
with a bounded response variable, one can also consider more general loss 
functions of the form £(y,u) := 4>(y — u), where (f> is an even nonnegative 
convex twice continuously differentiable function with (j)" uniformly bounded 
in M, (f)(0) = and 4>"(u) > 0, u£l. In binary classification setting (i.e., 
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when T = { — 1,1}), one can choose the loss £(y,u) = 4>{yu) with eft being 
a nonnegative decreasing convex twice continuously differentiable function 
such that 4>" is uniformly bounded in R and <j>"{u) > 0, h6 1. The loss 
function (j)(u) = log 2 (l + e~") (often called the logit loss) is a typical example. 

Note that the condition that the second derivative £'^ is uniformly bounded 
inTxR can be replaced by its uniform boundedness in T x [— M V 1, M V 1]. 
The constants in the theorems below will then depend on the sup-norm of 
the second derivative (and, as a consequence, on M); otherwise, the results 
will be the same. This allows one to cover several other popular choices 
of the loss function, such as the exponential loss £(y,u) := e~ yu in binary 
classification. 

We will also assume in what follows that N > (logn) 7 for some 7 > 
(this is needed only to avoid additional terms of the order og ° &n in several 
inequalities). 



2.2. Sparsity bounds. Our first goal is to provide upper bounds on ||/r e — 
f\e ||i 2 (n)) 011 K{\ £ , A e ) and on J2j<£j f° r an arbitrary subset J C {1, . . . , N}, 



in terms of the cardinality of this set d = card(J) and the measure J2j£j^j- 
The idea is to show that if A £ is approximately sparse, that is, there exists a 
small set J such that Sj^j 1S a l so small, then A e is approximately sparse, 
too, with a high probability and the 1,2-error of approximation of f\e by /t £ 

as well as the Kullback-Leibler error of approximation of X e by X £ are small. 
The first result in this direction is the following theorem. 



Theorem 1. There exist constants D > and C > depending only on 
£ such that, for all J C {1, . . . , N} with d := d(J) = card(J), for all A> 1 
and for all 



(2.2) 



e>D\ 



>d + A\ogN 



n 



the following bounds hold with probability at least 1 — N 



-A. 



0<c 




d + AlogN 



n 



d + AlogN 



n 



and 



\\h,-h4l 2[ m + ^ £ ^ £ )<C 



d + AlogN ^ d + AlogN 
- v Z^ A i\/- 



n 
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Note that these bounds hold without any conditions on the dictionary 
(except the assumption that the functions hj are uniformly bounded). How- 
ever, the result is true only for e > ^JlktA^EK ^ Since it is not known for 
which set J J2j^j 1S small, it is also not known for which d the condition 
(2.2) is to be satisfied. In other words, the regularization parameter e in this 
result depends on unknown degree of sparsity of the problem. 

In the next theorem, it will be assumed only that e > dJ~^^^-, but 
there will be more dependence of the bounds on the geometric properties 
of the dictionary. On the other hand, the error will be controlled not by 
d = card(J), but rather by the dimension of a linear space L that provides 
a good approximation of the functions {hj :j G J}. This dimension can be 
smaller than card( J), which makes the bound more precise. Given a subspace 
L of £2(11), define 

U(L):= sup ll/lloc + 1. 
/e£,||/IU 2 cn)=i 

It is easy to check that for any L2(II)-orthonormal basis <f>i, . . . , 4>d of L, 

U(L)< max ||<^ 7 -|| 00 \/ci + 1, 

where d := dim(L). In what follows Pl denotes the orthogonal projector onto 
L and L 1 - denotes the orthogonal complement of L. We will be interested in 
subspaces L for which dim(L) and U{L) are not very large and, at the same 
time, functions {hj :j E J} in the "relevant" part of the dictionary can be 
approximated well by the functions from L in the sense that the quantity 
max.j£j\\P L ±hj\\j J2 (jj) is relatively small. 



Theorem 2. Suppose that 



(2.3) 



e> D\ 



'AlogN 



n 



with a large enough constant D > depending only on I. For all J C {1, . . . , N}, 
for all subspaces L of £2(11) with d : = dim(L) and for all A>1, the follow- 
ing bounds hold with probability at least 1 — N~ A and with a constant C > 
depending only on I: 



(2.4) 



d + AlogN 



ris 



+ max\\P L xhj\\ L m) 



+ 



U(L) logiV 



ne 
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(2.5) 



j m + Hf W P L ±h iW L ^) 

U(L)]ogN~ 



and 



||j>-/A*|li 2 (n)+^(A £ ,A' 
(2.6) < C 



d + AlogiV v ^ A ^/AlogJV 



n *— * J V n 

3$J ' 



IIP h II AlogN „ U(L)\ogN 

jgJ v ; V n n 

If, for some J, 



'AlogiV 



and, for some L with U (L) < yd, hj £ L, j G J, the bound (2.6) simplifies 
and becomes 

It means that the fact that the dictionary is not orthogonal and even is 
not linearly independent might actually help to make the random errors 
||/^e — /A e lli 2 (n) an d K(X £ ,X e ) small: their size is controlled in this case by 
the dimension d of the linear span L of the "relevant part" of the dictionary 
{hj :j 6 J}, and d can be much smaller than card(J). 

2.3. Random error bounds. The following result is a simple corollary of 
Theorems 1, 2 and the properties of the loss function. Denote by C the linear 
span of the dictionary {hi, . ■ ■ ,hj^} and let Pc be the orthogonal projector 
on C C L 2 (P). Define 

Theorem 3. Under the conditions of Theorem 1, the following bound 
holds with probability at least 1 — N~ A , with a constant C > depending 
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only on I and with d = card(J): 

\\P{*»he)-P(l*fx<)\ 

(2.7) < C 



d + A\ogN ^ d + AlogN 



n 



9e\\L 2 (n) 



d + AlogN 



n 



11 



Id + AlogN 



n 



1/2 



Similarly, under the conditions of Theorem 2, with probability at least 1 — 
N~ A and with d = dim(L) 



\P(£»f X s)-P(£»fx*)\ 

d + AlogN 



<C 



n 



V [ VA- VmaxllPrx/i 



lAlogN 



{21 



V 



U(L)logN 



n 



9e\\L 2 (U) 



d + AlogN 



n 



V ( VA' Vmax||Pr±ft, 



■j\\L 2 (U) 



lAlogN U(L)logN 



n 



1/2 



Recall that denotes a function that minimizes the risk P(£ • /) and 
it was assumed that /* is uniformly bounded by a constant M. Clearly, by 
necessary conditions of minimum, we have 



Pie . fj hj = o, j = i, . 



,N, 



so, £'•/*£ Note that for any function / uniformly bounded by M and 
such that £' • / G £ (in particular, for /*) we have 

ll^llL 2( n) = ||^' • /aOIIl 2( p) = \\Pc(£' • /a« - ^ • /)IU 2 (P) 

< II • A« - ^ • /)IIl 2( p) < C||/ A . - f\\ L2i u), 

where we used the fact that £! is Lipschitz with respect to the second variable. 
Under the conditions on the loss function, for all A G A 



(2.9) S(f x ) > irdlMU V 1)||/ A - M| 2 2(n) =: r\\f x - f m \ 
which easily follows from a version of Taylor expansion for the risk. 



|2 

li 2 (n)> 
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To bound the excess risk £(/^ e ), one has to solve two different problems: 
bounding the random error 

Wx*) - £(A0I = W •/*.)- PV • M\ 

and bounding the approximation error £{f\e). Using the above facts, one 
can easily get from Theorem 3 the following bounds on the random error: 
under the conditions of Theorem 1 , with probability at least 1 — N~ A and 
with d = card( J) 



(2.10) 



^d + AlogN 



<C 



ii 



U + AlogN 



n 



d + A\ogN 



n 



<d + A\ogN 



■-,1/2 



and under the conditions of Theorem 2, with probability at least 1 — N A 
and with d = dim(L) 



I%)-«(/a. 



d + AlogiV 



V ^A^VmaxllP^^lli^n)^ 



(2.11) 



vc 1//2 i 



d + AlogN 



'AlogiV 



V 



U{L)\ogN 



n 



v E A i v m c a f H^V^j IImii) 



/A log AT C/(L)logiV 
V n n 

which reduces the problem to bounding only the approximation error. 



n l/2 



2.4. Approximation error bounds, alignment and oracle inequalities. To 
consider the approximation error we need several definitions. For A £ A, 
denote 



T A (A) := {v E 1T V : 3t > A + vt £ A}. 

The set ?a(A) is the tangent cone of convex set A at point A. Recall that H 
denotes the Gram matrix of the dictionary in the space L2(H). Whenever 
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it is convenient, H will be viewed as a linear transformation of M N . For a 
vector w G R , let 

a H (A,X,w) :=sup{(w,u)i 2 : u G T A (A), ||/ w |U 2 (n) = !}• 

We will call this quantity the alignment coefficient of vector w, matrix H 
and convex set A at point AG A. Note that 

ll/u|l! 2 (n) = (Hu,u) h = {H l l 2 u,H l ' 2 u) l2 . 

Therefore, the alignment coefficient can be bounded as follows: 

a H (A,X,w) <sup{(w,u)i 2 : u G M. N , \\f u \\L 2 (n) = 1} 

= sup (w,u) i2 =:\\w\\ H - 
\\m/ 2 u\\ e2 =i 

If H is nonsingular, we can further write 

\\wf H = sup {H- 1/2 iu,H 1/2 u)e 2 = \\H- 1/2 w\\ 2 e2 . 
l|tf 1/2 «IU 2 =i 

Even when H is singular, we still have 

\\w\\ 2 H <\\H^ 2 w\\ 2 i2 , 

where for w G Im(i7 1//2 ) = H l l 2 M. N , one defines 

\\H- 1/2 w\\e 2 : = inf{||u||/ a : H 1/2 v = w} 

[which means factorization of the space with respect to Ker(H l l 2 )\ and for 
w ^ Im.{H l l 2 ) the norm H-ff -1 / 2 ^!^ becomes infinite. It is also easy to see 
that if J = supp(ui), then 

H H IHk \\ w \\iooVd{J) 

\\ W \\H < T=F=S= w? t nn - 



V^j)(i-p 2 (J))- #FM' 

where d(</) := card(J), k(J) is the minimal eigenvalue of the matrix 

ilj = ((hi,hj) L a (it))i,jeJ 

and 



p(J):= S up{^ 



(/l:/2)L 2 (n) 



■■fi£Lj,f 2 £Ljc 



|L 2 (n)||/2||L 2 (n) 

Lj denoting the linear span of {/ij : j G J} [see Koltchinskii (2008a), the proof 
of Proposition 1, for a similar argument]. Measures of linear dependence 
similar to p(J) are known in multivariate statistical analysis as "canonical 
correlations." 

These upper bounds show that the size of the alignment coefficient is 
controlled by the "sparsity" of the vector w as well as by some characteristics 
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of the dictionary (or its Gram matrix H). In particular, for orthonormal 
dictionaries and for dictionaries that are close enough to being orthonormal 
[so that k(J) is bounded away from and p 2 (J) is bounded away from 1], 
the alignment coefficient is bounded from above by a quantity of the order 
ll^lkoo \fd{J)- However, the alignment coefficient can be much smaller than 
this upper bound and it reflects in a more delicate way rather complicated 
geometric relationships between the vector w, the dictionary and the convex 
set A. Even the quantity ||iiZ' _1 ' 2 w;||| , which is a rough upper bound on the 
alignment coefficient that does not take into account the geometry of set 
A, depends not only on the sparsity of w, but also on how this vector is 
aligned with the eigenspaces of H. For instance, if w belongs to the linear 
span of the eigenspaces that correspond only to the eigenvalues of H that 
are not too small, ||iT -1 / 2 K;|| 2 becomes of the order ||iu||| 2 . Note also that 
the geometry of the problem crucially depends on the unknown distribution 
II of the design variable [since one has to deal with the Hilbert space L2(n)]. 

For A G R N , let sf (A) := log(eiV 2 A i ), j G supp(A) and sf (A) := 0, j $ 
supp(A). Note that, for j E supp(A), Sj 1 (A) = log Xj + 1 + 2 log N and log Xj + 
1 is the derivative of the function A log A involved in the definition of the 
penalty. Let 

S Ar (A):=( S f(A),...,4(A))- 

It happens that both the approximation error £{f\e) and the "approximate 
sparsity" of A e can be controlled by the alignment coefficient of the vector 
sat (A) for an arbitrary AG A. Denote 

a N (X):=a H (A,X,s N (X)). 

Theorem 4. There exists a constant C > that depends only on I and 
on the constant M (for which ||/*||oo < M) such that, for all £ > and all 
AG A 

(2.12) S(f X e) + 2e J2 ^<3£(fx) + c(e 2 a 2 N (X) + ^ 

j^supp(A) 

Theorem 4 and either of the bounds on the random error (2.10) and (2.11) 
immediately imply oracle inequalities for the excess risk £(f^ s ). For instance, 
the next corollary follows from (2.11). 

Corollary 1. Under the conditions of Theorem 2, for all X £ A with 
J = supp(A) and for all subspaces L of -L2(II) with d := dim(L), the follow- 
ing bound holds with probability at least 1 — N~ A and with a constant C 
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depending on t and on M : 



' d + AlogN AlogN 
£(fX.)<4£(h) + C[ +max\\P L ±h j \\ La( B)d — - — 



+ 



U(L)\ogN 



n 



2.5. Density estimation and sparse mixtures recovery. In the case of den- 
sity estimation based on entropy penalized empirical risk minimization with 
quadratic loss, as in (1.3), the results are rather similar to what was de- 
scribed above for prediction problems (regression and classification) and 
their proofs are quite similar, too. 

Recall the notations at the end of the Introduction 1. Recall also the 
assumptions that the unknown density /* of distribution P is uniformly 
bounded by M and the densities in the dictionary hj are uniformly bounded 
by 1. 

The following results hold. 

Theorem 5. There exist numerical constants D > and C > such 
that, for all J C {1, . . . , N} with d := d{J) = card( J), for all A>\ and for 
all 



e> D\ 



' d + AlogN 



n 



the following bounds hold with probability at least 1 — N' 



-A. 



^HJ 



2 ,d + AlogN 
V n 



and 



\\h s -fx4l 2{ ,)+eK(\^X £ )<C 



M 2 d + M ^ N v £ A| J d + Al °Z N 



n 



HJ 



n 



Theorem 6. Suppose that 



e> D\ 



'AlogN 



n 



SPARSE RECOVERY 



15 



with a large enough numerical constant D > 0. For all J C {1, . . . ,N}, for 
all subspaces L of L2{P) with d := dim(L) and for all A>1, the following 
bounds hold with probability at least 1 — N~ A and with a numerical constant 
C>0: 



(2.13) 



E A 5 + M^^±4^ + max||P L ,/ l ,|| i2(P) 



+ 



U{L)\ogN 



(2.14) 



and 



EAl + M2 rf± J 41ogiV +ma 



ns 



L± h j\\L 2 (P) 



+ 



U(L)logN 



us 



\\h.-fx4l M +zm £ ^ £ ) 

2 d + AlogN 



<C 



M 



n 



(2.15) 



V$>Ji 



A log N 



n 



Vm || Pl-l hj\\ L2 ( P )i 



V 



'AlogN 



n 



U(L)logN 



n 



In the case of density estimation, it makes sense to redefine the alignment 
coefficient in terms of measure \i: 

a#(A,A,u;) :=sup{{w,u)i 2 : u G T A (X), \\fu\\L 3 (jj) = 1}, 

a N (X):=a H (A,X,s N (\)). 
The approximation error bound then becomes as follows. 

Theorem 7. There exists a numerical constant C > such that, for all 
e > and all A G A 

ll/A«-/.|li a0i )+2 e £ A|<3||/ A -M|| 2(/i) 

j'gsupp(A) 



(2.16) 



+ C(^(A) + - 
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Finally, this results in the following oracle inequality. 

Corollary 2. Under the conditions of Theorem 6, for all X £ A with 
J = supp(A) and for all subspaces L 0/^2(1!) with d := dim(L), the following 
bound holds with probability at least 1 — N~ A and with a numerical constant 
C: 

\\f%e - f*\\ 2 L 2 (p) ^ 4 II/A-/*||1 2 ( AI ) 



n (^ 2 d + A\ogN AlogN 
+ C[M + m&x\\P L ±h j \\ L2{u) \ 

3. Proofs. The proofs of Theorems 1 and 2 are quite similar. We give 
only the proof of Theorem 2. 

Proof of Theorem 2. The following necessary conditions of minima 
in minimization problems defining A e and A e will be used to derive sparsity 
bounds: 

N 

(3.1) P{i' . f X e)(f %e - f X e) + £ £(log Af + 1)(A| - A|) > 

and 

N 

(3.2) P n (£' . - A.) + e£(log^ + 1)(A| - Af) < 0. 

i=i 

The inequality (3.1) holds because the directional derivative of the penalized 
risk function (which is convex) 

JV 

A9Ah-P(W A )+eX> i logA i 

3=1 

at the point of its minimum A £ is nonnegative in the direction of any other 
point of the convex set A, in this case in the direction of A e . The inequality 
(3.2) is based on similar considerations in the case of penalized empirical 
risk (note that in this case the minimum of the convex function is at A £ and 
we are differentiating in the direction to the minimal point, not from the 
minimal point). Subtracting (3.1) from (3.2) and replacing P by P n in (3.2), 
we get 
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N 

• AO - (t • A0)(A« - AO +eE( lo e^ " lo s*i)(*i - A i) 

3=1 

(3.3) 

<(P-P n )(^./ A .)(A.- AO- 
Note that 

AT ^ / A e \ 

E(log A* - log Af - Xj) = E log M (A^ - AJ) = /^(A £ , A e ), 

3=1 3=1 V J 7 

so bound (3.3) can be also written as 

PW • AO - & • A0)(A« " AO + ^(A £ ; A £ ) 

(3.4) 

<(P-P n )(^,/^)(/- £ -/ A£ ). 

To extract from this bound some information about approximate sparsity 
of A £ note that 



K(X £ ,X £ ) = Y,( lo ^)m-^) 

(3.5) 



>^ E ^ E 4 



j:\ £ >2\ E . j:X E >2X E 



2 ^ J ' 2 

7':A e >2A e 
J j — J 

This implies that for any J C { 1 , . . . , N} 



(3.6) E^< 2 E\ £ + i^(^ £ ). 

Similarly, 

(3.7) E^ 2 E*i+db^ £ >n 

Therefore, taking into account (3.4), 

(3.8) e E A £ < 26 E \ £ + ( P " P «) ' A- ) ( A« _ AO- 

Since the second derivative of the loss function is bounded away from 0, we 
also have 

P((£' • AO- C • A0)(A- - AO > 4he - fx4 2 , 
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where c = r(l) (note that ||/a £ ||oo < 1 and ||/ Ae ||oo < 1)- In view of (3.4), this 
implies 

(3.9) c\\fr e - f x .\\ 2 + eK(X e , X s ) < (P - P„)(^ • fxMe ~ AO- 



Denote 

A(*; A) := ( A £ A : ||/ A - / A . || L2(n) < 5, ^A-j < a|, 

a n (5;A) := sup{|(P n - P){{£' • / A )(/ A - / A .))| : A € A(5; A)}. 

The following two lemmas are somewhat akin to Lemma 5 in Koltchin- 
skii (2008a). We will give below the proof of Lemma 2 that is needed to 
complete our proof of Theorem 2. Lemma 1 can be used in a similar way in 
the proof of Theorem 1, which we skip. 

Lemma 1. Under the assumptions of Theorem 1, there exists constant 
C that depends only on I such that with probability at least 1 — N~ A , for all 

n~ 1/2 <5<1 and n~ 1/2 < A < 1 

the following bounds hold: 



a n (5;A)<p n (5;A):=C 



(3.10) 



d + AlogN v J d + AlogN 



n 



n 



U + AlogN AlogN 



n 



n 



Lemma 2. Under the assumptions of Theorem 2, there exists constant 
C that depends only on I such that with probability at least 1 — N~ A , for all 



(3.11) n~ 1/2 <<5<l 
the following bounds hold: 
a n (6;A)<(3 n (S;A) 



and 



n 



-V2 < A < 1 



(3.12) 



:=C 



J? 



11 



hp All / ^log^ w U(L)logN „ AlogN 

Vmax WP^hj \\l 2 (ii)\ V V 

jeJ K ' \ n n n 
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(3.13) 



S = \\f\* ~ /a £ ||l 2 (ii) 



and 



the following bounds hold with (3 n (5, A) defined in (3.12): 

(3.14) c5 2 <[3 n (5,A) 

and 



(3.15) 



£ A<2e£A* + — /3^,A) 



provided that 5 > n" 1 / 2 , A > n" 1 / 2 . In the case if 5 < re" 1 / 2 or A < re" 1 / 2 
one can replace 5 or A, respectively, by re -1 / 2 in the expression for (3 n (5, A) 
and still have bounds (3.14) and (3.15). The proof below goes through in this 
case, even with some simplifications. In the main case, when 5 > re" 1 / 2 , A > 
re" 1 / 2 , it remains to solve the inequalities (3.14), (3.15) to complete the proof. 
To this end, note that (3.15) can be rewritten (with a proper adjustment of 
constant C) as 



eA < CA\ 



+ C 



'AlogN 



n 



'd + AlogN 



n 



I AlogN 



ii 



VmaxUPix/ijIli^n)' 



'A\ogN U{L)logN v ,41ogiV 



re 



71 



n 



Under the assumption that the constant D in the condition (2.3) on e is 



larger than 1, the term ^ e j\J M °n N 111 ^ ne maximum can be dropped 

since it smaller than the first term sj^j^j ^j- H D > 2C, the bound can be 
further rewritten as 



sA<C 



l d + AlogN 



n 



V max\\P L ±hj\\ L2 (n)i 



'AlogiV 



U(L)logN AlogN 
V— — — - — V — 



n 



ii 
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(again with an adjustment of C). To get a bound on A, it is enough to 
solve the inequality separately for each term in the maximum and take the 
maximum of the solutions. This yields 



A<C 



n 



V max \\P L ±hj \\ L2 (n ) 



1 AlogN U(L)logN AlogN 



V 



V 



Ti- 



ns 



ns 



Under the assumption (2.3) on e (with D > 1), this can be further simplified 
and the bound becomes 



A< A(5) :=C 



5 d + AlogN 



n 



.._ - I U(L)logN AlogN 
V max WPj+hi |U 2(n) V V J 



Let us now substitute A{5) instead of A in (3.14) [note than /3 n (5,A) is 
nondecreasing in A]. This easily gives the following bound on 5: 



5 2 <C 



d + AlogN 5 d + AlogN AlogN 



V 



n 



n 



n 



V 



U{L)logN AlogN 



ns 



n 



'AlogN 



n 



Vmax||P L xM L2( n)i 



1 AlogN v U(L) log N v AlogN 



n n n 

and the second and the third terms in the maximum can be dropped again 



since -\ ° s < 1. Thus, we have 

5*<c 5j d+AlogN vY.*h 



n 



'AlogN 



n 



VnMx||P L j./ij||i 2 (n)i 



'AlogiV v U(L)logN v AlogN 



n 



n 



n 



which gives the following bound on 5 2 \ 



5 2 <C 



d + AlogN y ^ 



n 



' AlogN 



n 
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(3.16) 



U log AT U(L)logN 



n 



n 



This can be substituted back into the expression for A(S) yielding the bound 
on A: 



A<C 



■it J 



d + AlogN 



ns 



l/AlogiV\ 1/4 d + AlogN 



n 



n 



V 



U{L)logN d + AlogN 



ne 



n 



VmaxllPx/i-ll 1/2 WAlogN\^ Jd + AlogN 

V lllfiX \ \rj J- fin r stt\ — 



n 



IIP », II x/ U(L)logN x/ 
Vmax||P L x^|| L2( n) V — V 



'A log AT 



which, using the inequality ab < (a 2 + b 2 ) /2 and the condition 
can be simplified and rewritten as 



1 . / A log AT 



< 1, 



A<C 



(3.17) 



^ x£ d + AlogN . . 

E A i v — v Wf WP^hjU^xi) 



log AT „ / A log AT 



7?c 



7/ 



with a proper change of C (still depending only on £). Now we can substitute 
(3.16) and (3.17) in the expression for (3 n (5, A). We skip the details that are 
simple and similar to the bounds earlier in the proof. In view of Lemma 
2, this gives the following bound on a n (5,A) that holds for 5, A defined by 
(3.13) with probability at least 1 — N~ A : 



a n (5,A)<C 



d + AlogN 



n 



I A log N 



n 



VmaxllP^x^llL^ny 

Together with (3.9) this yields the bound 

c\\f XE -fx4l 2{ n)+^(X £ ,X £ ) 



'A log AT U (L) log A" 



n 



n 
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(3.18) 



<C 



d + AlogN 



n 



l A\ogN 



n 



'A log AT U(L)\ogN 



Vmax \\P L ±hj \\ L „m)i 
which is equivalent to (2.6). Bound (2.4) follows immediately from bound 



(3.17) (under the assumption on e, the term J " 41 ° gjV is smaller than 



Hog TV 



so, it can be discarded), and bound (2.5) follows from (3.7) and (3.18), which 
completes the proof. □ 

Proof of Lemma 2. The proof relies on Talagrand's concentration 
inequality for empirical processes as well as on Rademacher symmetrization 
and contraction inequalities [see, e.g., Koltchinskii (2006) or Massart (2007) 
for their formulations in a form convenient for our purposes] . By Talagrand's 
concentration inequality, with probability at least 1 — e~ l 



(3.19) 



a n (5;A) <2 



/t Ct 
- + — 
n n 



and, by symmetrization inequality, 

Eq„(5; A) < 2Eswp{\R n ((£' • / A )(/ A - / A .))| : A G A(5; A)}. 

Since 

nfx(-Wx(-) - /a«0) = ^(/a«0 + «H«=A(0-A«(0 
and the function 

[-1,1] 3 u^£'(f x .(-) + u)u 

is Lipschitz with a constant C depending only on £, the application of 
Rademacher contraction inequality yields the bound 

(3.20) Ea n (S; A) < CEsup{\R n (f x — f\e) \ : A G A(«5; A)}. 

Now we use the following representation 

fx - /a« = PlUx - /a«) + E( a j - >$) p iA-hj 

3&J 



(3.21) 



+ £(A i _A J e )P i x/i J , 



Clearly, for all A G A(5, A), 

II-Pl(/a -/aOIU 2 (ii) < ||/a - /a*IU 2 (ii) 
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and Pl(/a — fx-) £ L, which is a d 

Esup{|P n (P L (/ A - / A .))| : A E A(5; A)} < C5 

y n 

[see, e.g., Koltchinskii (2006), Section 2, Example 1]. On the other hand, 
since A, A e E A, we have J2jeJ I Aj — A|| < 2 and 



Esup j P„ ^(Aj- - X^P^hjj : A E 

well-known approach to bounding the sup-norm 



We now proceed with rather 
of Rademacher sums: 



; A(<5;A) I <2Emax|P n (P L x/i,)|. 



Emax 



| P n (P l _l /ij ) | < CE max 1 1 P L x hj || La m) 



/log card (J) 



77 



< Cmax\\P L ±hj\\ L 



llogN 
n 



+ . /E maxl 



,<(U(L)- l)||P^-|U 2 (n) + l 



logiV 



n 



Note that 

<{U{L)- l)||^-|U 2 (n) + 1<U(L). 

We use symmetrization inequality together with Rademacher contraction 
inequality to get the following bound 

E max | R n (P L ± hj ) | 



< C 



max\\P L xhj\\ L2 ( n y 



/logiV 



n 



max \\P L ±hj\\ 00 Ema~K\R n (P L ±hj 

The last inequality can be solved for 

E max | R n (P L ± hj)\, 



\ogN 



n 



which gives the bound 



Emax|i? n (P L x/i, )| < C 



max ||P L -l^ |U 2 (n)i 



/l0giV . + C /(L) l0giV 



n 



n 
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Quite similarly, we have for all A € A(<5, A) 



£|A,-A*|<A + £A 



and 



Esup< 



< ( A + ^A| )Emax|i? n (P i _ L /t 



: A G A(<5; A) 



Repeating what we have done in the case of j G J, we get 

log AT 



Emax| J R n (P i ±/i j )| < C 



'log AT 



7? 



+ [/(£)- 



7? 



where we used the fact that 

\\ P L^ h j\\L 2 (U) < \\hj\\ L2 (U) < 1- 

It remains to recall representation (3.21) and bound (3.20) to show that 
Ea n (5,A) 

'd . /log A" 



<C 



V Ai 



n 



'log A/" 



(3.22) 



Vmax||-PL-L/ij||L 2 (n) 1 



n 



VVM^V!7(L)^ 



which can be bounded further as 



Ea n (<J,A) <C 



n 



'logA^ 



77 



'log AT 



71 



(3.23) 



Vr f e a j H-^M^nr 



'logA^ ?7(L)logAT 



7? 



71 



This can be plugged in (3.19) to get that with probability 1 — e ' 

a n (<5,A)</3 n (<5,A,i) 
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(3.24) :=C 



.._ - I /logJV U(L)\ogN . ft t 

V max P L x £a( n) \/ V B V «5 J - V - 

jeJ ■ ' \l n n Vnn 

with a constant C > depending only on £. 

We will make the above bound uniform in 8, A satisfying (3.11). To this 
end, define 

Sj := 2~ j and Aj := 2~ j 

and replace t by t + 2 log(j + 1) + 2 log(fc + 1). The union bound implies that 
with probability at least 

1- £ exp{-t-21og(i + l)-21og(fc + l)} 

2 

-t 



1 ~ fE0' + 1 )" 2 ) exp{-t} > 1 - 4e" 

\7>0 / 



V/>0 

for all 8 and A satisfying (3.11), and for j,k such that 

Se(8 j+1 ,8j] and A e (A fc+ i, A k ], 
the following bound holds: 

a n (8; A) < A fc ,i + 21ogj + 2 log k). 



Since 



and 



21ogj < 21oglog 2 ^-^ <21oglog 2 Q 



21ogfc<2bglog 2 ^Y 

we have 

P n {5 j ,A k ,t + 2logj + 2logk) 

< fi n [25, 2A, t + 2 log log 2 (j) + 2 loglog 2 (^) ) =: /3 n ,(5; A; t) 

and, therefore, with probability at least 1 — 4e _i , for all 8 and A satisfying 
(3.11), 

a n (8;A)<p n (8;A;t). 
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Let t = AlogN + log 4 (so that 4e ' = N A ). Then, with some constant C 
that depends only on £, 



<C 



d w x J AlogN v ^ /21oglog 2 (2/5) w Xi /21oglog 2 (2/A) 



S\I-VS 
n 



V 5\ 



n 



n 



n 



'log JV 



VmaxllP^/ijIli^n)^ 



re je 



'logA 7 



V U(L) log N v 21oglog 2 (2/a) v 21oglog 2 (2/A) v ^llogiV 



re n 
For all (5 and A satisfying (3.11), 



n 



n 



21oglog 2 (2/5) < log log n 



n 



n 



and 



21oglog 2 (2/A) < log log n 



n n 

By assumptions on A 7 ", n, A log Af > 7 log log n. Therefore, for 5 and A satis- 
fying (3.11), 



a n (5,A)<p n (5;A;t)<C 
(3.25) 



d . AlogN 



5\I-V5 
n 



V 



n 



'log A 7 



n 



i^Vmax||P Li ^|| L2(n) ^ 



'log AT 



n 



v U(L)logN ^ AlogN 



n 



n 



which holds with probability at least 1 — N A . □ 

Proof of Theorem 3. The proof easily follows from the fact that 
under the conditions on the loss function we have 

(£.f^)(x,y) - {£• f X e)(x,y) = (£' • h*)(x,y)(f Xe - fx<)(x) + R(x,y), 

where 

\R(x,y)\<C(f Xe -f X ef(x). 
Integrating with respect to P yields 



|P(Wi.) - Pit • M - p{* • /a«)(A« - h<)\ < c\\f X e - fx4l 
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It is enough to observe that 
W«/a«)(/a«-/a«)I 

= ](£'• fx*, fx ~ Ml 2 (P)\ = \{Pc(? • AO, /a- - /A«>L a (P)| 

< I|5eIU 2 (-P)II/a^ ~~ /a s IU 2 (ii) 
and to use the bounds of Theorems 1 and 2. □ 

Proof of Theorem 4. The following bound immediately follows from 
the definition of X £ : for all A € A, 

N N 

£(M + ^E A i lQ g(^ 2 ^) < £ (h) + e E X o logC^Aj). 

3=1 3=1 

Denoting Ja = supp(A) and using the convexity of the function it ulog(iV 2, ii) 
and the fact that its derivative is \og{eN 2 u), we get 

+ ^ E A i < £ (/a) + e E ( A i log^Aj-) - Af log(AT 2 A5)) 

j^A 3'eJx 

< £ (/a) + e E log(eiV 2 A i )(A i - A|), 

which, by the definition of ajv(A), can be further bounded by 

f(/A) + e|«iv(A)|||/A-/A £ |U 2 (n)- 
Next we use obvious bounds [recall (2.9)] 



fx - /a s IU 2 (ii) < \\f\ ~ /*IU 2 (n) + ||/a s - /*IU 2 (n) < \/ — + \/ ^ ' 



to get 



S (f X s) +e £ A|log(iV 2 A|) < f (/a) + e|M A )l + \/^— ) 



Since 



and 



e|a^(A)| W — — < + -£(/a), 
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this yields 



\s{j x .)+e £ A|log(iV 2 A*) < \8{ h ) + ^M^. 



Note that also 



Aflog(iV 2 A|) = J2 A J £ log(iV 2 Ap+ A, £ log(iV 2 A|) 

>- E ^ 



j^Jx:\ £ ,>eN 



3- 

2-i 



where we used the fact that the function 1 1— > tlog(N t) is bounded from 
below by — Thus, 

E A5 log(iV 2 A|) > £ AJ - £ ^ - -L > £ Af - (e + e" 1 )! 

Therefore, we get 

S(M + 2eY: ^ < 3£(/a) + + 2(e + e" 1 )^, 

which implies the result. □ 



The results concerning penalized density estimation can be proved quite 
similarly. 
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